) 



. DOCUMENT RESUME 



ED 227 150 

AUTHOR * 
TITLE 



INSTITUTION 



' -P^JB DATE 
Nd1*E , 



-AVAILABLE FROM 
PUB TYPE 



EDRS PRICE 
DESCRIPTORS 



IDENTIFIERS 



TM 830 161 



Bliss, Leonard B. 

Validation of the Use of the Stanford Achievement 
Test with U.S. V.I. Students. Virgin Islands of the 
United States Public School Basic Skills Assessment 
Survey, Technical Report No. 1.. 

College of the Virgin Islands, St. Thomas. Caribbean 
Research Inst. 
Jan 82 

54p.; Paper presented at the Annual" Meeting of the 
American Educational Research Association (66th, New 
York, NY,' March 19-23, 1982). 
"Caribbean Research Institute, College of the Virgin 
Islands, St. Thomas, USVI 00802 ($2.00). 
Speeches/Conference Papers (150) — Reports - 

Research/Technical (143) , 

A • ■ . 

MF01/PC03 Plus Postage, v > 

♦Achievement Tests; *Basic Skills; Data- Analysis; 
Educational Assessment; Elementary Secondary 
Education; Language Skills; Mathematics Achievement; 
Reading Achievement; Sampling; *Standardized Tests; 
*State Programs; *Test Reliability; *Test Validity 
*Stanford Achievement Tests; Virgin Islands 



ABSTRACT 

A sample of slightly over 1500 students was drawn 
from even-'n umbered grades i'n public schools of the U.S. Virgin . 
Islands, and was- given the 1973 edition of the Stanford Achievement 
Test (in grades 2,4,6, & 8) and the Test of Academic Skills (grades 
10 and 12) to assess student academic achievement in the basic skill 
areas of mathematics, reading, and English language. This report , 
describes phase I of the data analysis, which involved the 
determination of levels of content validity and reliability of the 
scores obtained from these Virgin Islands students on these tests 
which were originally standardized on continental United States 
populations. The results indicate that the tests are content valid 
for use in Virgin Islands public schools at these grade levels and 
that the scores obtained are at least as reliable as those obtained 
- using continental U.S. students during the test standardization 
procedures. (Author/PN) 



******************************************************************** 

* Reproductions supplied by EDRS are the best that can be made * 

* « from the original document. * 
*********************************************************************** 




• ' 1 



u s. department of education 

national institute of education 

'educational resources information 

CENTER (ERlO 
^ This document has b»en reproduced* as 
received from the person or organization 
originating it 

Minor changes have been made tQ improve 
reproduction quality * * 

• Points of view or opinions stated m thisdocu 
men! do not necessarily represent official NIE 

position or policy 



Virgin Islands of the United States 
Public School Basic Skills 
Achievement Survey 

Technical Report #1: 
Validation of the Use of the Stanford ■ 
Achievement Test With U.S. V.I. Students 



"PERMISSION TO REPRODUCE THIS 
MATERIAL HAS BEEN GRANTED BY 



1 . ft M l&£ 



TO THE EDUCATIONAL RESOURCES 
INFORMATION CENTER (ERIC) " 



PRF SENT nil AT THE ANNUAL MEETING 
OF THE AMERICAN ELMJC AT TONAL 
RESEARCH ASSOCIATION 
NEW YORK CITY, MARCH 19- ?3, 1982 



Caribbean Research Institute , College , of the~V±rg±tr je lands 
Leonard B. Bliss, PnfD. - Principal Investigator*, 

January 1982 



> PREFACE 

With the appearance of Virgin Islands of the United 
r 

States Public School Basic Skills Achievement Survey , Tech- 
nical Report #1: Validation of the Use of the Stanford 
Achievement Test With U.S. V.I. Students the Institute has* 
embarked on- the publication of a Working Paper Series. 
These papers are intended to present the author 1 s (and the 
Institute's) point of view on various subjects as a matter 
for discussion and comment by those who agree as well as 
disagree with expressed positions. In this way the Insti- 
tute hopes that the final versions will be improved in styl 
as well 'as rigour. ^ < 

The present paper is the first phase of a study of 
basic skills in the schools of the United States Virgin 
Islands requested by the Board of Trustees of the College ✓ 
The work has taken considerably longer than anticipated due 
to fundamental alterations in the design so as to provide 
greater depth than originally planned. Unfortunately, 
shortage of staff did not 'allow the pr6gress hoped for to 
be made. 

The data for the whole project have been collected, 
however, and work is proceeding on pheir interpretation 
and the compiling of che three reports' which will follow. 

Norwell Harrigan 
Director 

- i- 

'3 * 

4. 



Abstract 

A sample of slightly over 1500 was drawn from even numbered, 
grades in public schools of the U.S. Virgin Islands and were 
given the 1973 edition of the Stanford Achievement Test (in 
grades 2,4,6, & 8) and the Test of Academic Skills (grades 
10 and 12) in an attempt to assess student academic achievement 
in the basic skill areas of mathematics, reading, and English 
language. This report describes (pha^se I of the data analysis 
which involved the determination of levels of content validity 
and reliability of the scores obtained from these Virgin Islands 
students on these tests which were originally standardized on 
continental United^ States populations. i 

* 

The results indicate that the tests are content valid for 

use in Virgin Island^ public schools at all. of these grade level 

and that the scores obtained are at least as reliable as those 

* 

btained using continental U.S. students during the test 

ndardization procedures. 

i ' 



4 



It is almfost becoming a matter of faith thkt achievement 
in basic skills \ i.e. English language and mathematics) in 
public schools /under the American flag has deteriorated over 
the last twen/y years. Proponents' of this idea point to evi- 
dence as formal as decreases in typical scores on the Scholas- 



tic Aptitude Tes£ and standardized tests of academic achieve- 
ment and as informal as ~the quality of writing and arithmetic 
skills they perceive in- the young people around ttiem. 

* 

The reactions of people to this perceived phenomenon are 
also varied.* On the government level they include the require- 
ment that all students score a minimum grade on a test, of basic 
skills in order to receive a high school diplpma; that teachers . 
pass a similar test to obtain, teacher certification; and that 
"schools require students tQ take additional course work in basic 
skills areas. In addition, federal, state, and local governments 
have initiated programs to provide support in the forms of grants 
and technical assistance to schools at all levels to do research 
and set up programs designed to improve student achievement in 
basic skills. 

At a different level, parents, concerned that the public 
schools are not doing an adequate jqp in preparing their children 
in basic skills areas, are choosing, in increasing numbers, to 
remove their children from public schools and place them in 
religious and secular private schools • While there are other ^ 



• . • -2- 

reasons for the proliferation of private schools besides the 
purely academic, the desire for high quality academic prepara- 
tion is one compelling cause of this phenomenon. 

_ The public schools, themselves, have reacted strongly to « 
this crisis in public confidence. These reactions include an 
increase in required courses in language and mathematics areas 
with a corresponding decrease in electiyes in areas considered 
less "basic." Projects "to revise curricula in basic skills * 
areas proliferate and are receiving more support than they ha$e . 
since the ^evaluation of -American education engendered by the 
shock of Sputnik in the late 1950 1 s. 

Impr-oving basic ^skills achievement was a, concern of the 
Department of Education of the government of the Virgin Islands 
of the United States when it' approached the College of the Virgin 
Islands to provide aid in improving such instruction. In an 
effort to provide this service, the Caribbean Research Insti- 

. tue, the college's research arm, worked with a task^ force com- 
posed of representatives , from the Department of Education and 
CRI to determine a course of action. 

It became clear after the first few task force meetings 
that development of any strategy designed to improve ba-sic 
skills achievement needed to start off with a fairly detailed 
description of current achievement levels of students in terri- - 
tbrial public schools. This information was not available. 

-Public school students were administered a standardized 
achievement test only at th£ end of sixth grade (T he Iowa Test 

r 

of Basic Skills ) . In other elementary grades most students 



9 

ERIC 



-3 - 

were tested annually or semiannually at their schools, but the 
test given and the «times during the academic year that were 
administered varied greatly and' apparently at the whim <5f build- 
ing administrators. The results of thjzse tests stayed at the 
schools and were not collected at any central point. On the 
secondary level there was no program of standardized achieve- 
ment testing. 

An additional factor which limited the use of previously 
collected achievement level data was that all scores were re- 
ported in a norm referenced manner. That is, scores did not 
indicate which basic skills examinees had or lacked, but rather 
J how examinee's scores compared to those, obtained by a group of 
students to whom jihe tests were previously administered in- the 
continenta 1 United States. The Iowa Test of Basic Skills 
administered £o sixth graders did make comparisons with other 
V.I. sixth grade students (i.e. they reported using local norms), 
but even these were of no use in determining whether or not 

individual -students hatf attained specific basic skills. 

♦ * 

It was .decided to test a representative . sample of U.S. 

Virgin Islands public school students using a standardized 
basic skills battery. Choosing the test, the following criteria 
were used: 

i 

1) The test must be* technically sound in terms of 
reliability and item discrimination, . at least 
for the group it had been field tested on. 

2) The test must be content valid fox U.S. Virgin 

m Islands public school students. That is, there 



■4- 



needed to be *a high degree of hatching between 
the content and behaviors sampled by the test 
and those actually in the curriculum taught at 
various levels in the U.S. V.I. public schools. 
.3 The test must include a detailed statement of 
the objectives tested while providing an item 
by objective keying procedure. 
4) Scores which indicate students* performances 
relative to each objective: must be available. 
That is, criterion-referenced scoring must be 
provided. ^ 
* The 1973 version of the Stanford Achievement Test (Basic 
Battery) was chosen as the test which appeared to meet the 
criteria listed above. It was administered to slightly over 
'1500 students in the Fall of 1980 'in both the St. Thomas/St. 
John and the £t. Croix school districts. This is the first 
of series of research reports' des igned to make available the 
results of this rather -complex study. A simple, brief example 
of, the quancity of data obtained may serve to highlight the 

scope of this stutfy. The Intermediate Level II of the Stanford 

/ - 

Achievement Test (administered to sixth graders in this study) 
contained 351 items. It was administered to 225 students in 
the U.S. Virgin Islands sample yielding 78,975 individual 
pieces of data. The sixth grade sample, due to a technical 
difficulty (the principal in one school forgot t^ assign the 
teacher of the selected class the task of giving the test and 
the teacher in another school administered only four of the 



ERLC 



-5 - 

seven subtests), contained the smallest numbe'r of examinees 
of any grade level.,.. Additional reports Will be issued regu- 
larly as soon as results become available. 



9 

ERLC 




—1 



Validity and RelJ&fe-ility of 
Test Scojfe; 

ml 

This first report dealsMth the establishment of the 
validity and reliability of <mk test scores. "Validity refers 



'J 



to the extent to which the/ pit measures those characteristics 



'which it is intended, to mea ||f e . Ebel (1961) has referred to 

, .sffff 

validity as "one of the maj|j| dietie/s 'in »the pantheon of the 

psychometrician" (p. 640). JKree types are now commonly used 

IPf 

m educational and psychological measurement (see French and 

Michael,- 1966). These ar^cfontent , criterion-r,elated , and 

construct validity. ' 

Gronlund (1976) indiM|es ■ that , "Cr iterion r related ; valid- 

ity may be defitied as th^txtent to which test performance is 

related to some other va|pd measure of performance 11 (p. 83). 

This may be performance" ^ a task in the future (i.e. predic- 

tive validity )< or on somlfjjreaent objectives not directly 

measured by the test (i.jg$ concurrenc validity). Since the 

purpose of admin istering;Pn achievementS:est is to get a direct 

measure, of present studejj^ mastery of certain academic objec- 

tives (i.e. there is no tempt to predict future performance 

* M 

or to infer performance fievels on objectives not directly 
measured by the test), fciS,terion~related validity is not an 
issue in determining the appropriateness of the Stanford 
•Achievement Test in measuring academic achievement in this 
study. m " ' 

The tern "construct validity" was first introduced into 

} ' ' ' 



ERIC 



-6- 



the area of psychometrics by Chronbach and Meehl (1955) who 
defined a construct as a postulated (that is, assumed or 
hypothetical) attribute of people that underlies and determines 
their overt behavior. If the behavior can be directly observed, 

or if the trait can be operationally defined, it is not a aon- 

\ 

struct in this sense/ Ebel (1979) notes 

£fost of what we teach in educational institutions 
are knowledges, skills, and abilities. These 'can 
all be defined operationally. They are not hypo- 
thetical constructs. Ability to type, to' spell, 
to weld, ability * to solve problems with algebra, 
calculus, or computers; these aire ftot^ the kind of 
latent traits Cronbach and Meehl had in mind. We 
would speak more sensibly, I think, if we did not . 
call them constructs, (p. 307) ■ ^ 

Construct validity is concerned , with whether or not a test is 

accurately measuring the construct it. J p ur P ort:s to measure 

Since this study is operationally defining basic skills 

achievement as the performance of students on the Stanford 

Achievement Test, it is clear that no cons truct'-cs being 

measured. Hence, construct validity will riot be a concern in 

this report. 

The content of any curriculum can be ^thought of as being' 
composed pf subject matter 'content and 'behavioral changes sought 
in students,' * For' a test to be content valid it 'must provide re- 
sults that: are representative of the topics and behaviors we 
wish to measure. More formally, . . content validity may be 
defined as the e xten d to which a test measures a' representative 
sample <9JL^£h^/4ubj ect matter and the behavioral changes under 
consideration 1 * (Gronlund, 1976 , 'pp . 81-82), Effective strategies 
for determining 'content validity invplve determining the objec- 



tives-^ainplj&d^by .the test , and examining the curriculum to 
asce^J&dlfr the degree of match between them. Achievement tests 
are primarily concerned with measuring the acquisition of cer- 
tain skills and knowledges (objectives) by students, at the time 
that the test is given. Thus, it is content validity Chat 
should be of prime concern in this study. Specifically, do 

.the objectives tested by the Stanford Achievement Test corre^- 

' » -* 

pond to those taught toward / in the schools of'tlie U.S. Virgin 1 

t • 

Islands? , * - 

Reliability deals with the consistency of the scores of 

test over time and ov£r different examinees. it is^urely 
V, * / • 

tfa statistical phenomenon and cannot be determine^ logically 

/as can content validity. Fuller mo re , it is a .function of the- 

scores of the test rather than of the test, itself. This aeans 

tnat a test may give highly reliable scores for one group of 

examinees, but result in lower reliabili ty -with another group*. 

T 

In essence, what we are concerned with is whether or not the 



test scores represent measures of the same traits each time the 
test is givert. . j % 

* . * r 

It is important, then, 'that whatever treasure of 'basic 



skills achievement is used, that the measure be content valid 
foj: the curriculum used in Virgin Islands public schools and 
'produce reliable scores when administered jto Virgin Islands 
public school students. 1 1 

As any good commercially available standardized test , tlje 
1973 edition of the Stanfbrd Achievement test Was standardized 
on a 'large sample of students. The SAT Technical Data Report 

- 1 . 



(1975) indicates, that a sample of over $75,000 pupils 'from 109 

school systems in 43 stripes in the United States, made up^trhe 

standardization samples used. Table 1 provides descriptions 

of these samples and Kow they compare with a 'description of the 

population of the continental United States.- Content. validity 

was established by curricular analysis using information from a 

large 1 * number of sources-. ' - * 

* Ba'sic'to therc^txstt^ttioh v of a series of achievement < 
tests is the identification of what is being taught 
in the schools across 'the nation. The most important • 
sources for curti'cular analysis were (a) textbook 
series in various subject are"&s (including the prepa- 
ration of detailed analysis of the cpntent of the 
( • books most widely used in, each 1 field); (b) a wide 
>^ variety of courses of study from individual school 
systems; (cj 'statements of objectives from various 
state "and national committees', and the opinions of 
experts in various fields; and (d) the research 
literature pertaining to children's concepts, expe- ' 
rience, and vocabulary. . (Technical Data Report, p. 12) 

> - ■ . 

The reliability of the scores '6^ the standardization sample 

was determined by, using the Kuder-Richardson Formula 20 and by 

calculating the standard error of measurement of the scores . 

Two measures of reliability were used ^nce it is known that 

high homogeneity in tested groups will lower the reliability 

estimates obtained using the KR-20/fcut that his effect is dealt 

wi.th in- determining standard errors. In addition, the standard 

error of * measurement is more meaningful in interpreting scores 

of individual students. With very few 'exceptions > the reliabil- 

ities obtained from the standardization samples ranged from ,84 

to .95 using the KR-20 formula. 

While the 1973 version of the Standford Achievement Test 

appears to be educationally sound based on the standardization 



A * ■ Tables 1, 

Summary of Characteristics 
of Standardization Samples! 

% • 



S tanf ord S t anf o rjl 

Characteristics Population * Range 



Percent of pupils by 

community size , \ 

0-49/ 999 . * 70. 0 ' 

50,000-249, 999 14.2 
250,000 or more 15,9 

Percent of pupils by ? , 

Geographic Region * * - v 

Southeast 23.8 * 

North Central 21.6 

Northeast 26.5 

West 28.2 

Median Family Income $9,096 $ 4,878 to 

$13,593 
* . • 

Median Years of Schooling 

(Adults 25 yrs. & older) 12.1 8.4 to 

12.6 

Average Class Size 

(Student-Teacher Ratio) 26.4 " lg to 

36 , 

Average Starting 

Salary of Teachers $7,116 $ 4,500 to 

1^00 

* * » ' ' 

Average Salary of 

Teachers . ^ $9 ,3 60 $ 4,500 to 

11,500 

Median Years 

Teaching Experience .10.8 5 to 24 

Percent * of Grade 1 »• 
pupil^ who attended . 

kindergarten % 84.6 0, to 100 



Percent -of Schools Using 
Some Team Teaching 



67.1 



Table 1 continued 



Characteristics 



Stanford 
Population 



"Stanford 
Range 



National - 
U.S. Population 
1970 Data 



Percent of Schools Using 
Some Teacher Aids 



97.5 



Percent of Pupils .Not 
Promoted to Next \ 
Highest Grade 

Grade 1 
Grade 2 
Grade 3 
Grade .4 
Grade 5 
Grade 6 f 
Grade 7 
Grade 8 
Grade 9 



\ 



3.9^ 

1.8 

1.5 

0.9 

0.8 

1.1 

1.2 

1.3 

2.4 



0.0 to 25 

0.0 Ko 15 

0.0 to 10 

0.0 to, 10 

0.0 to 5 

0.0 to 5 

0.0 to 11 

0.0 to 9 

0.0 to 9 



Percent of Pupils 
p Non-public Schools 

Percent of Major 
Ethnic Minorities 

Blacks 
' , Hispanics 

Other 



11.6 
4.6 
Less than 1 



0 to*60 
0 -to 60 



12 



11. 1 
4.6 
Less than 1 



1 From Stanford Achievement Test: Technical Data Report, p.. 21. 



. \ 



ERIC 



groups data, the groups contained only continental U.S . 
students . Likewise, the ' test makers most probably did not 
take Virgin Islands public school curriculum into account 
when designirtg items. Therefore , before the' scores of any ' 
tests of basic skills can be used to draw conclusions about 
V.I. students, the content validity and reliability of thes£ 
test scores' for Virgin Islands students must be established. 
Hence-, this report. 1 



1 ■> 

* u 



Method \ r 

Sampling' ' 

The June 1, 1979 enrollment in the public schools in the' 

Virgin Islands or the United States was 25,426 according to 

the statistics issued by the V.I. _Depar tment of Education. It 

.i 

.was clear that testing this number of students was economically 
unfeasible. -The preferred alternative would have been to 
generate a random sample of students in grades K-12 to be tested, 
but it was equally clear that this would have produced an in- 
tolerable disruption of classroom activities. Therefore, in 
an attempt to obtain f representative sample of students, 
cluster- sampling was used with the clusters being defined as' 
classes. The number of classes to be selected for the sample from 
each grade in each of the St. ^Thomas/St. John and St. Croix 
districts was determined by calculating the proportion of the 
total K-12 student population in each grade in each district' 
and assuming a class size of thirty. 

Selecting whole classes presented an additional difficulty. 
The small number of classes selected in each grade might have 

made obtaining a representative sample of students more diffi- 

> * 

cult. This is due t<3 the fact that while classes in a given 

* t 

6 * 

elementary school may be heterogeneous, the schools themselves 

^are not. Thi-s is because elementary schools in the O.S. Virgin 

» 

Islands are essentially neighborhood schools. Virgin Islands 
neighborhoods tend to be homogeneous in terms of socioeconomic 
Status of residents. To overcome this problem, it was decided 
to increase the -number of classes * tested in a given grade 



9 

ERIC 



(thereby increasing the - number of schools wi.thin the -..territory ' 
from which these classes came) without increasing the total 
number of students %st.ed by testing at alternate grade's. This 
seemed acceptable sifibe many of the objectives tested by the 
Stanford Achievement %st carry across adjacent levels of the 
. test and there was no %ason to suspect that the patterns of 
academic achievement of Students in odd numbered grades were 
different from those in Jben numbered- grades . 

It was originally proposed that students in odd numbered 
grades be tested during the\spring of 1980, but ^difficulties " 
in obtaining testing materials resulted in testing being post- 
poned until the Fall of 1980. \ In order to deal with, the cohort 
of students originally selected^ even numbered grades were 
actually tested.. 1 

The classes to be tested were chose.n by chance. Specif i- 
cally, for each grade in each district a listing of classes 
. . was made and each class was assigned*a number. A table of 

random numbers was consulted. Numbers were drawn from the table 
until there were the same number of random numbers chosen as 
thete were classes needed for the sample. In the case of 
duplicate numbers being drawn, the duplicate .was ignored and 
another number chosen. If the number chosen was outside the 
- range of the number of classes on the list, it was ignored and 
another number was chosen. When sufficient numbers had- been 
drawn, the listed classes .which corresponded to these, numbers 
were included in the sample. This procedure was repeated for* 
each grade in each district. 

1 *. 



-The sole exception to this p-rdceduro was in the' eighth ■ 
grade portion of the sample. On Si. Thomas homeroom classes 
are somewha't homogeneous in that students repeating eighth 
grade and those in the eighth grade for the first time are 
placed in separate homeroom classes. Since the levels of 1 
academic achievement for repeaters and nonrepeaters are very 
likely different, the proportion of repeaters and nonrepeaters t 
was determined to come out with a^ number of classes needed in 
the sample from each group and the groups of classes of repeat- 
ers and nonrepeaters were sampled separately in the manner 

* 

« 

described in the preceding ft^ragraph, 

.Elena Christian Junior iiWi School on St. Croix <Ls on split 
session, principal of that school felt that there were 

definite differences in ^achievement levels between the students 
in the morning and afternoo"n sessions. Because of this, classed 
in,, the morning and afternoon sessions were sampled separately 
using the same procedure employed on St. Thomas for the repeating 
and nonrepeating homeroom classes,. ^ 

If simple random sampling has been used in selecting 
student^ to* be tested, a sample size of approximately 2000 
would have been the rnaximum siz^ required to. obtain an accuracy 
of about ±2% at a .95 level of confidence, when estimating the 
proportion of V.I. students reaching certain objectives from 
the sample proportions if a typical proportion answering each 
item correctly were .50 (aee Asher, 1979, p. 166). In actuality,' 
due # to student absences, 'failure of school personnel to carry 
out requested tasks, # and other difficulties, the sample size 



ERIC 



• -16- 

obtained was only 1535. However, examination of the difficulty 
indexes of the Stanford Achievement Test itefns on all 'levels 
revealed difficulty indexes considerably different from .50 
<?n mos*t items. This would tend to shorten the size of the 
confidence interval. Finally, the 'financial and organizational 
constraints cited previously forced the investigators to use 
cluster sampling techniques rather than random sampling. 'Sii^ce 
tftie intraclass correlations (i.e. the Effects of clustering on 
the standard deviations of the achievement test scores) were 
not known, this factor also contributes toward making: the above 
mentioned accuracy estimate a rather crude one. -It can, however i 
'serve as a rough guideline. • „ / 

Table 2 presents the relevant sample size data. The sixth* 
and second grade samples from St. Croix are smaller than had 
been hoped for *the following reason^. * As - indicated 'pre- 
vibusly, the ceacher 6f -one of the sixth grade classes only " 
administered fou£ of the seven subtests. In a second grade 
class, the teacher was ill during the days set aside for test- 
ing and the test was not hdmihis tered. By the time this became 
apparent qo the investigators, if was too late to go back' to 
St.- Croix to'retest. 

Aside from the difficulty in estimating precision of the 
proportions , of students obtaining correct scores, on various . 
iteiits, the sampling procedure used presents another 'difficulty, 

* HP J 

« m 

Because of the previously stated practical considerations, it 
was necessary to employ cluster sampling (sampling whole classes) 
rather than simple random sampling ol students to be tested. 



2} 



Y 



/ 



-17- 



Table 2 ^ 



U.S. Virgin Islands Sample .Sizes 



Test Total . St . Thomas/St . John St,. Croix 

Grade > Level System District District 



•12 ' 


TASK IE , 


129 


10 


TASK L 


254 


3 


Advanced 


345 


6 


Intermediate II 


227 


4 . 


Primary, LI I 


t 346 


2 


Primary I 


234 




TOTAL 


1535 



74 . \ 55 

167 87 

173 172 

146 ' $1 

186. 160 

143' 9l 



889 - 646 



The principal drawback to cluster sampling, is -the likelihood of 
increa'sed sampling error. In general, as the siz£ of the sample 
increases, the size of the standard error decreases. Thio applies 
however when each, sample element (in this case, each student) is 
selected independently ox every other t element . In cluster sam- 
pling the elements are, «by definition, 3eiected in a group rather 
than independently. ' The effect of clustered selection on the 
standard error Will depend on ' the* simiTarity^Between the elements 
in the cluster and those in the population. In many cases, 
sample elements selected in clusters will not show- the same 
variation as an equivalent number selected ^Independent ly . 
Students who attend jthe same school and are in the same class 
may be more like one another in a characteristic such as aca- 
demic achievement than students in the public school population 
as a whole . 

The relationship between clustering and sampling error may 
be summarized as follows. If all the elements (students),, in a- 

. 21 

C 



'cluster (class) were identical with regard to achievement and 
-totally different from the elements in other clusters the - 
sampling error would be extremely high. Clustering, in this 
case, would tend to make the clustered sample ^equivalent in 
size to a simple random sample with as many subjects as there 
are clusters , rathVr than elements . Hence, £ sample made up 
pt "60 clusters might be equivalent to a simple random s^m^le of 

60 individuals. This\ is obviously an extreme case that is 

\ ' • 

never seen -in- practice ^ At the opposite extreme would be a 

series of clusters showing th,e same variation within each 

cluster as simple -random Samples of the same size. In this - 

V * . r 

case, each cluster would represent the entire population/ 
another ,condit:j.ot\, rarely met in practice. Most sampling situa- 
tions fall in be ttf|*|2n* these extremes, tending toward on x e or the 
other according to the characteristic being studied. In general, 
according to Warwick and Lininger (1975), experience has shown 
that well-designed cluster samples Will produce standard errors 
that are about one and one-half times as large as the standard 
errors from simple random samples of the same size. . j 

This situation should not have any effect* eta the descrip- 
tive statistics reported in this document, but it will enter, 
into the* interpretation of the , results -of Hypothesis testing 
usirwg parametric techniques since these latter techniques rely 
on estimates of the standard error. These will be discussed 
as tfle results of these tests are dealt with. In general, ) 
however, the resulting under-estimates of the standard error 
of the means will result in test statist ics ' that are higher 



-191 

than they woultl have been if clustered standard error estimates 
had been used. 

Notwithstanding the difficulties involved in sampling , 
the researchers are confident that the resulting samples are 
as representative of the entire V.l. public school population 
as is really possible given the nature of working with human 
subjects and the organizational considerations of schools com- 
bined with the resources available for the sturdy. The diffi- 
cuities encountered are not untypical oi those comnpnly found 
when doing field work in both public and private schools. * 

♦ Testing Procedure 

Testing was done at ihe grade l£vel recommended by the' 
te^t -publisher . Table 3 indicates the subtests of the battery 
given to ekch grade. This was primarily done to insure the 
content validity of the 'examinations . Tests were administered 
by classroom teachers or guidance counselors, at the discretion 
of building administrators. Each person who was to administer 
. tests attended a two hour training session* at either the College 

.of the Virgin Islands St. Thomas or St. Crqix campuses. During 
this time the purpose of the testing was explaihed, the test 
and, instruction manual were reviewed, a testing schedule was 
distributed and reviewed, and testing materials were distributed. 
These included a practice test for each of grades £, A, find 6. 
This was to be given to students the day prior to the firstf 
day of testing in order to give them practice in reading and 
' answering multiple choice standardized tests. 



ERLC 



o 



3 



-20- 

Tests were administered in" the St. Thomas/St. John district 
during the -week of October 21, .1980 and in the St. Croix district 
during the week of December' 1, 1980. Testing materials and. com- 
pleted answer sheets were collected, answer sheets checked to 
determine compliance with marking instructions, and answer 
documents were sent to the Psychological Corporation of Iowa City, 
Iowa to be machine scored. & 



0 I 



TabLe 3 ' 

Stanford Subtests Administered 
at Each Grade Level 



Gradt 



Subtest 



Number of Items 



Grade 12 
(Task II Level 



Reading 

Mathematics 

English 



78 
48 
69 



Grad- 10 

Via k I Level) 



Reading 

Mathematics 

English 



78 
48 
69 



(Advanced Level) 



Vocabulary 

Reading Comprehension 
Mathematics Concepts 
Mathematics Computation 
Mathematics Applications 
Spelling 
Language 



50 
74 
35 
45 
40 
60 
SJ 



Grade 6 

^Intermediate II Level) 



Grade 4 ' 

(Primary Level III) 



Grade 2 

(Primary Level I) 



Vocabulary 50 
Reading Comprehension 
Mathematics Concepts 
Mathematics Computation 
Mathematics Applications 
Spelling 

Word Study skills 
Language 

Vocabulary 

Reading , Comprehension 
Word Study Skills 
Mathematics, Concepts 
Mathematics Computation 
Mathematics Applications 
Spelling 

.Language ^ 
Vocabulary 

Reading Comprehension 
Word Study^ Skills 
Mathematics * Concepts 
Mathematics Computation 
Listening Comprehension » 



71 

35 

45 

40- 

60 

50 

80' 

45 
70 
55 
32 
36 
28 
47 
55 

37 
87 
60 
32 
32 
26 



9 

ERIC 



Or 



-21- 



Results 



0 

ERIC 



Table 4 provides" descriptive statistics using the raw scores, 
(number of items correct) of students on each subtest of 'the 
Stanford Achievement Test. * 

Content Validity , - #' 

. The content validity of. the various levels of the Stanford - 1 

Achievement Test used to collect data on basic skills achievement ' 
was determined by using the following- strategies: 

I) Collection of. written curriculum guides used in the public ,■ ' 
^schools. The objectives explicitly stated or implicitly 
. _ inferred. in these documents were' compared with .the lists 
of objectives tested provided by the test publisher. 
•2) Text bSoks used' in the teaching of basic skills subject 
matter were collected from selected schools* ' Stated and 
" implicit objectives in these texts were compared with the 
test publisher !s objectives. \ 

3) The test objectives were shown to elementary, and secondary 
subject area supervisors who were askfcd to determine the 

j degree of match, between those objectives \and what is taught 
in the^public schools at the 'indicated grade levels. 

4) Selected building principals in St. Thomas were asked, to 
review the objective's of the test and give their opinions 
concerning the degree of match between ' these objectives - 

and the objectives taught toward in the classes in. their v 

schools. . 

I 9-.* 

~U . .22- 



Table. 4 



Test 



Reading 
Mathematics 
KngJ iiih 



Descriptive Statistics of Stanford 
Achievement Test Raw Scores 



A). S.V.I . System 
Mean Stand. Dev. 



STT/STJ 
Mean Stand. Dev. 



A3. 9 
25.3 
46:8 



Grade 12-Task II Level 



13.8 
-8.3 
11.2 



AO. 6 
2A.6 
4-5.6 



s Grade 10-Task I Level 



12.3 
7. A 
10.9 



i STX 
Mean Stand Dev. 



A8.3 
26.3 
A8.4 



15.2 
9.4 
11.5 



fading 
» Matlumat $«e« 
Fngl i -jh 



A5.6 1A.0 . A3. 6 1A.2 

32.0 • -14i6 W ■ 32.i-.ih? 16.9 
48.0 12.0 • 47.6 «10.8 

Grade 8-Advanced Level 



A 8". 5 
A9.0 



1A.3 . 
, 3.6 
1A.0 



\ 'cabu 1 rv 


21.0 


7.2 


20 


7 


6.6 


21.2 , 


8.5 


Reudin? Comprehension 


31.5 


15. A 


32 


6 


17. A 


30.5 


13.1 


Mathematics Concepts 


15.3 


5.8 


16 


A 


6.0 


• • 1A.2 " 


5.3 


Mathematics Computation 


23.0 


7.6 


• 23 


3 


7.1 


22.8 


8.2 


Mathematics Application 


16.9 ' 


6.6 


17 


7 


• 6.6 


16.1 


6.5 


Spelling 0 


31.8 


. 12.3 


32 


8 


11,8 


' . 30.8 


12.9 


Language 


35.6 


12.1 


35 


.8 


10.6 • 


3A.6 


13.6 



Grade" 6-Intermediate II Level 



Vocabulary 

Reading Comprehension 

Word Stuciy Skills 

Ma L hema t i es Concep ts 

Mathematics Computation 

Mathematics Applications 

Spelling 

Language 



,21.6 


7. 


8 • 


22.6 


8.1 


. 19.7 


6.8 


32.5 


12. 


A 


* 31.8 * 


12.7 


. 33.5 


11.5 


28.6 


11. 


1 


29. A 


11.5 


27.1 


10.3 


18.2 


5 


7 


19.3. 


5.5 


16.4 


5.5 


25; 0 


'7. 


A . 


24.8 


8.0 


'25.4 


6.3 


18. A 


8. 


0 


19.2 . 


8.0 


16.9 


7.8 


> 35.2 


13. 


6 


35.5 


1A.'2 - 


3A.7 


12.5 


37. A 


13. 


8 


3 7. ,9 


■ 15.0 


3,7.0 


11.5 




Grade 


A-Primary III Level . 









Vocabulary 


' 23.6 


7.2 


23 . A 


5.9 ' 


2A\0 


' 8. 4^, 


Reading Comprehension 


.42.3 


11. 5 


• ' '42.2 ' 


11. 3-' ' 


A2.3 
" . 58.7 


11*8 


'ord Study Skills 


29.8 


10.0 


30.8 


9.2 


10.8 


.•.theraatics Concepts 


15.5 


5-3 


, . 15.0 . 


4.4 


16.0 


6.2 


.Lnematics Computation 


20 ..A 


6.1 


19.6 


.5.1 


21.4 


7.0 


schematics Applications 


13.8 


5.8 * 


13.8' 


5.5 


13.8 


6.0 - ' 


.tiling 


30.9 


10.2 


30. A 


9.2 


31.5 


11.2 . 


jnguage 


28.9 


8.8 - 


28.1 


8.1 


r . 29,8 


9.4 



.ERiO 



r 



27 



-2 3- 



Test 



U.S. V.I. System 
Mean Stand. Dev. 



Vocabulary 21.7 

Reading (Part A) 34,1 

Reading (Part B) 29.5 

Wqrd Study Skills 47.5 

Mathematics Concepts ' 19. 0 

Mathematics* Computation 21."5 

Listening Comprehension' 16.8 



STT/STJ , . STX " 

Mean Stand. Dev. ' Mean Stand. Dev. 



Grade 2-Primary I Level ' 



5.1 
13.8 
' 8.9 
9.4" 
4.4 
5.0 
4.3 



22.7 


4.8 


~\ 20.1 


5.1 


36.3 


15.2 


30.5 


10.3 


30.7 


' 8.4 


• 27.6 


9.5 


48.7 


. 8.7 


45.6 . 


10.1 


19.7 


4.3 


18.7 


5.6 


22.0 


4.6 


20. 7 


5.4 


17.9 - 


4.0 


15.2 


4.4 




-24- 



ERIC 



;>) Teachers who administered the tests in their class- 
rooms were asked to review the test publisher'^ 
object i'vOs and to determine the degree of match 
between these objectives and the bas'ic skills they 
expec t ed t hoi ( r student s to have obtained. 
Using these techniques, the researchers were satisfied . 
that the test djd, indeed, test a sample of objectives that 
was consistent wittrUie objectives* used in teaching in the 
public schools of the Virgin Islands of the United States. 

Reliability * 

The estimates of reliability of the test scores are pre- 
senced in "Table 5. The RR-20 reliability estimate 2 for each 
test is reported along with the' KR-20 estimate for the mainland 
standardization samples as presented in the Technical Data Report . 
The issue of interpreting these reliability estimates is a complex 
one and, will be dealt with in more detail at the conclusion of 
this report. The' author felt the need to have at least a 
tentative criterion for making decisions regarding the accept- 
ability of the reliability ^stimates obtained from the V.I. 
sample of examinees, the Stanford Achievement Test is considered 

* 

2 

? 2 * 

r xx = Jn/(n-l)] [o x'^Pq/^ 0 x>l 

where r xx = the reliability estimate -(From Stanford Achieve- 

p- the number of scores ment Test: Technical 

o 2 x = the variance of the distribution * Data Report , p. 35) 
of scores 

p= the proportion of examinees marking 

the .correct answer on a particular item 
q= 1-p 

- — — m - - - 1 — - ' 




to have more than acceptable reliability when! administered to 
the population of examinees upon which it was standardized 
(i,eV continental U.S. students). Among the indications of 
this are numerous- reviews of the test in the literature 
(Kasdon, 1974; Lehmann , 1975; Chase, 1978;, Ebel, 1978; Thorndike, 
1978), and the fact that it is widely used in the schools. 
However, the literature is replete with studies which indicate 
that standardized tests of academic achievement tend to produce 
less reliable scores when administered to students from low 
socioeconomic stratus homes and to those who are culturally 
different from the majority of those on whom the test was 
normed (see. reviews and discussions in Anastasi, 1958; Tyler, ■ 
1956; and Deutsch, 1960). Therefore, if the reliability 
estimates obtained from a sample of U.S. Virgin Islands students 
who took the Stanford Achievement Test are at least equal to 
the reliability estimates obtained from the standardization 
samples, it is reasonable to conclude that the test scores are 

reliable indicators of academic achievement for these students. 

* ** 

For each reliability estimate obtained from the V.I. sample, 
a reliability difference was found by subtracting the standard- 
ization groups' reliability estimate from the local groups 1 
reliability estimates. The distribution of these differences 
is shown by the histogram in x Figure 1. The median reliability 
difference was -.038 with a range from -.20 to +.05' with the 
distribution skewed 'to the left (i.e. negatively) quite markedly . 

In addition in art attempt to observe these reliability 
differences from another perspective, for each pair of relia- 
bility estimates (the standardization group estimate and «the 



Table 5 



Stanford Achievemeht Test Raw 
Score Reliability Estimates 



TEST 



STAND. 
GROUPS 
KR-20 



USV1 

SYSTEM 

KR-20 



ST THOMAS/ 
ST JOHN 
KR-20 



ST CROIX 
KR-20 



Grade 12 -TASK II Level 



Reading 
* Mathematics 
English 



Reading 
Mathematics 
English ' 



.94 
.94 
.94 



.93 

.91* 

.87* 



Grade 10-TASK I Level 



95 
94 
95 



93* 
98 

92*. 



Grade 8 -Advanced Level 



Vocabulary .89 
Reading Conprehension .94 
Mathematics Concepts • .86 
Mathematics Computation .89 
'Mathematics' Application .91 
Spelling .94 
Language .94 



.81* 

.95 

.74* 

.85* 

.83* 

.93 

.88* 



Grade 6-Intermediate II Level 



Vocabulary .90 

Reading Comprehension .94 

Word Study Skills .95 

Mathematics Concepts .85 

Mathematics Computation .90 

Mathematics Application .92 

Spelling .94 

Language .94 



85* 

93 

93* 

79* 

85* 

89* 

94 

92* 



Grade 4 -Primary III Level 



J- 



Vocabulary .88 

Reading -96 

Word Study Skills " .94 

* Mathematics Concepts .86 

Mathematics Computation .87 

Mathematics Application .92 

Spelling .93 

Language , .92 



,83* 
,91* 

90* 
.77* 
,33* 

86* 

93 

8.6* 



91* 
90* 
84* 



94 
99 
90* 



78* 

97 

81* 

83* 

83* 

92* 

84* 



86* 

94 

94 

78* 

88 
«89* 
,95 

93 



75* 
91* 
88* 
66* 
75* 
84* 
91* 
84* 



95 
91 
90 



94 

,89* 
95 



37 
93 
75* 
87 

82* 
.93 
,91* 



80* 

91 

92* 

77* 

78* 

39 

93 

87' 



88 

92* 

92 

84 

87 

87* 

94 

38* 



9 

ERIC 



•27- 



Table 5 (cont . ) 

STAND. USVI ST THOMAS/ 



GROUPS SYSTEM . ST JOHN ST CROIX 

TE6T KR-.20 KR-20 KR-20 . KR-20 



Grade 2-Primary I Level , 

Vocabulary .86 .72* .71* .71* 

Reading Part A .94 .98 .99 .94 

Reading Part B .95 .92* .91 .92* 

Word Study Skills .93 .92 .91 .93 

Mathematics Concepts .81 .71* J .69* .74* 

Mathematics Computation .87 .81 .78* ^ .83* 

Listening Comprehension .77 774 .70* ' .72' 



*Signif icantly lower than the standardization groups KR-20 
at p=.05 



-.9- 



V.I. sample estimate) y the hypothesis that the differences in 

reliability obtained were less than zero was tested. Reliability 

estimated* were transformed using f transformations to normalise 
» 

the skewness of the distribution of the correlation measures 
and the hypothesis tested with t-tests\ -One tailed signifi- 
cance tests wore used. As indicated by Table 5, 64 of JOS com- 
parisons show lower re 1 iabiiity « in the V, I. -sample (i.e. differ-' 
onces less than zero). 

A note of caution is in order in interpreting the results 
of * hose Usts of jpLgnificant differences. As previously f 
pointed out, cluster sampling was used in obtaining the sample 
f of Virgin Island students to be- tested rather than simple random 
sampling. The result of this is that the actual standard error 
of the sample nay very well be larger than the one used in cal- 
culating the t statistic (estimated* bj* 1 1/ (N-3)J ^1 ) . The result 
of this would be that the values of t obtained were larger than 
they shpuld have -been and that some^of the differences from 
zero that were noted in Table 5 to .be significant at the p = .05^ 
level may actually not have been. To put it in technical terms, 
the probability of Type I error is probably greater than .05 in 



each of these hypothesis tests. This is a definite weakness 

Y 

in any conclusions we might draw from these tests. However, 

from a practical point of view, given the decisions to be made, 



3 

Z = \ log p (1+r \ 
e xy 



(1 " r xy } (Hayes, 1973/ pp. 662-667) 



't = Z - E(Z) 
V 1/(N-3T 



33 



Figure 1 



Frequency Distribution of Differences Between the Standardization 
Group. Reliability Estimates and the V.I. Sample 
Reliability Estimates (Ar) - 



.20 -.13 -.16 



14 



.12 
4r 



10 



.08 



a, 06 



/ 



04 



1>02 



y. * 



31 - 



Type I error is the error oi pre'f erenc<? . • Thai is, the conse- 
quences 01 mistakenly assuming thai scores are less reliable? • 
for V.L. would be that we would either look more closely at 
these tests from which the scores came or discard the results of 
the testing as. being unreliable for V.t* students. In this case 
what is lost is m6ch time and, possibly^, some money. On the 
other hand, the consequence of Type IT errors (mistakenly 
assuming that^local scores are at least as^ reliable as the 

4 

standardization groups r scores)* would be 4 to godhead and use 
the unreliable scores 1 - to make ^tecisiona^ahout basic ^ills ^ t ^^,^^ t 
levels of V.I. students and , possibly , to make decisions regard- 
ing instruct ional strategies that will be used in the schdols. 
In essence, then, what results is a rather liberal test of the 
hypotheses and, given T:he nature of the decisions to be made, 
this flay not be totally undesirable. However, it must be kept 
in mind when int$rpreting_Jftiese results that the actual level 
of Type I error is not known and that it is probably r higher 
than .05. In any<event, we can' use Table 5 to flag tests where ■ 
reliabilities may be less than acceptable.' 

It was noted that, in the majority of cases, the variances 
of t;he raw scores obtained by the V.I. sample were considerably 
1'ower than those reported for the standardization groups. This 
homogeneity is a phenomenon commonly found when testing -samples 
drawn from populat ions composed largely of persons .from low 
socioeconomic status homes. "The reliability of any test is 
partially dependent on the sample of individuals tes ted to- 
obtain the coefficient. In general, the more heterogeneous 



er|c 36 



m 



-32- 

the sample with respect to whatever the test is measuring, the 
higher the reliability coefficient will be' 1 ( Technical Data 
Repojt), p. 35). The standardization groups' reliability co- 
efficients can be adjusted for homogeneity using the variances 
obtained from the local sample making reliability comparisons 
' more meaningful. 4 

Using the adjusted reliability estimates for the standard- 
ization groups, the differences between this group's and the 
V.l. sample's reliability estimates .was calculated employing 
r ,^he (M S|me procedure used with the unadjusted estimates. The 
distribution of these differences is shown in the histogram in 
Figure 2, The median reliability difference using the adjusted 
estimates was -.002 with a range from -.06 to .02. As with the 
unadjusted scores, the distribution is negatively skewed, but> 
not as markedly as .with the unadjusted reliability estimate 
differences. When the standardization group's reliability 
estimate^" are adjusted for homogeneity, the differences between 
the reliability estimates of the two samples become fewer and 
smaller. 

Table 6 presents the adjusted estimates of reliability for 

the standardization groups' and the results of tests of the 

hypotheses that the differences between the standardization 
» 

groups' reliability estimates and the estimates of reliability 



4 ... • 

Using the formula p = (l-o"2 */tf2 ) (1-p, *y*) 
% * ^y x x x 

where p xy and a x are, in* this case, the reliability coefficient ' 
and variance of the standardization groups and p x *y* anc * °^x* are 
the same statistics for the V.I. sample/', * 

■ (Hayes, 1973) 

ERIC ' 



4 



-J 



Figure 2 

Frequency Distribution of Differences Between tb.£ 
Standardization Group Adjusted' Reliability 
Estimates and the V..I. Sample 
Reliability Estimates (Ar) 



50 



F 
r 
e 

q 

u 
e 
n 
c 

y 



45 
40 

35 
30 



25 



20 



IS 



10 



5 
0 



-.06 -.04 -.02 



11 



02 



t r 



ERIC 



33 



-33- 



Table 6 

Adjusted Stanford Achievement Test Raw Score ..Reliability Estimates 



TEST . ' 


USVI SYSTEM • 


ST THOMAS /ST JOHN 


ST. CROIX 


ft 


Adj . Stand, 
Grouos 
KR-20 • 


Local 
Sample 
KR-2'0 


Adj . Stand. 

Groups 

KR-20 


Local 
Sample 
KR-20 


Adj. Stand. 

Groups 

KR-20 


Local 
Sample 
KR-20 






GraHp 12 - 


TASK II Level 








Reading 


.93 


.93/ ' , 


.91 


.91 


.94 


.95 ' 


Mathematics 


i .88 

* 


.87 


.85 


.84 


.91 


.90 


English 


.91 
- Grade 10 


.91 

- TASK I. Level 


.90 


.91 


.91 


Reading 


.93 




.94 


.94 


.94 


.94 


Mathematics 


.97 


.98 


.97 


.99 


.90 


.89 


English 


.93 


.92 


' .92 


.90 


.95 


.95 



Table 6 (cont.) 



TEST USVI SYSTEM ST THOMAS/ST JOHN ST. CROIX • 

Adj. Strand. Local^ A Adj. Stand. Local Adj. Stand. Local 

Groups Sample V Groups Sample Groups Sample 

KR-20 KR-20 i\ KR-20 ' # KR-20 KR-20 % KR-20 - 



Grade 8 - Advanced Level 



Vocabulary 


.32 . 


.81 


.79 


.78 

* 


.87 


.87 


Reading Compr£sension 


.94' 


.95 


.95 


•97 m 


.92 - 


.93 


)Y<d ih^mat ics Concepts 


.79 


.79 


.80 


.81 


. 75 


.75 


Hit hematics Computation 


.85 


w85- 


.83 


.83 


.87 


.87 


Mathematics Appl icat ion 


.83 


.83 


1 ~ .83 


.83 


.83 


.82 


Spelling 


.93 


' .93. 


.92 


.<k 


.93 


.93 


Language 


.90 


.88* 


.86 


.84 


.92. 


.91 


- 




. A 


* 












Grade '6 - 


Intermediate 


II Level 






Vocabulary 


.36 


.3* 


\87« 


.86 


.81 


".80 


Reading Comprehension 


.92 


' - .93 > 


..93 


■ .94 ' 


.91 


.91 


WorK Study. Skills 


.93 


.93 


.94 


.94 


.92 


.92 


Mathematics Concepts 


- • 7 ? - 


.79 


.78 


' .78 


.77 


.77 


Mathematics Computation 


."85 


, .85 


.87 


.88 


.79 


.78 


Mathematics Application 


.90 


.89 


.90 ' 


.89 




.89 


Spelling 


.94 


. °4 




.95 


.93 


.93 


Language 4 i. 


.9? 


. 92 


^ .95 
' 292 


. .93 


.88 





Table 6 (cont.) 



T€ST 


US VI SYSTEM 


ST THOMAS /ST JOHN 


ST. CROIX 


Adj . Stand. Local 
Groups -Sample 
. KR-20 KR-20 


Adj . Stand 
' Groups 

XT 

KR-20 


Local 
Sample 
KR-20 


Adj. Stand. 
Groups 
KR-20 


Local 
Sample 
KR-20 




m 


Grade 4 


- Primary III 








Vocaoulafy 


.84 


.33 


.76 


.75 


.88- ' 


.88 


-Reading Comprehension 


.93 


.91* 


-92 


.91 


.93 


• 92 , 


Word Study Skills 


.91 


.90 


.89 


.88 


.92- 


.92 


Mathematics Concepts 


.78 


. 77 


.68 


.66 


« .'"84 < , 


,.84 


Mathematics Computation 


.83 


.'83 


. .76 


.75 1 


.87 


.87 


Mathematics Application 


.87 


.86 ' 


-.35 


.'84 


.38 . 


- .87 


Spelling 


.93 


.93 


.91 


:91 


.94 


.94 


Language 


. .86 


.86 


.84 


.84* 


.88 " 


.88 






Grade I - 


Primary I Level 






Vocabulary 


• 

.76 


.72 


.72 


.71 


7 A 


7 9 


Reading Part A 


.97 


.98" 


.97 


.99 


.94 


.94 


Reading - Part B 


.93 


.92 


.92 


.91 


. y *» 


. 


Work Study Skills 


.91 


.92 


.89 


.91 


.92' 


.93- 


Mathematics Concepts 


.72 


. 71 


.70 


.69 


~. .74 


.74 


Mathematics Computation 


.80 


.81 


.76 


.78 


,82 


.83 


Listening Computation 


.78 


. 74 . 


.73 


.70 


.78 


.72 



*Signif icantly lower than the standardisation groups' KR-20 at p=.05 



for the scores obtained by the V.I, oanple are loss th.m zero. 

Again, the caution mentioned in discussing the hypothesis u-sts 

♦ 

using the unadjusted reliability estimates holds true. The 
actual leve'l* of Type 1 ejrror involved in these tests is not 
really known and is most probably higher than .05: Even under 
dj^^ conditio!), however, only 2 out of 108 comparisons showed 
difference? significantly less than zero. Since the level 
was formally used, the^e two differences could be expected as 
a result of Type I error (i.e. as n result of 'chance). In fact, 
slightly more than r ) differences significantly less than zero' 
would have been expected on a chance basis. 

The standard error of measuremen t^ for the raw scores of the 
V.L. sample are shown in Table 7. . .when the reliability 
of a test is interpreted in terras of the standard error of' 
measurement, the problem of the influence of heterogeneity 
[or homogeneity J is taken into account, since the formula for 
the standard error of measurement includes the standard devia- 
tion of the scores" ( Technical Data Report ', p. 35). The 
standard error or measurement can be thought of as the stan- 
dard deviation of the differences between the scores obtained 
on the test and the true scores (the scores the examinees would - 
have received if the* test were perfectly reliable). As such, 
it can be used to determine an interval within which we can be' 
confident tnat the true score falls. For instance, we can be 
confident tl.at the true score^would be within one standard 

SE = s/"l-r xx where SE .is che -standard error of measure- 

ment S is the standard deviation of the 
scores, .and r\< is the reliability coefficient 

(Gronlund, 1976) 



-Table 7 

. Stanford Achievement Test 
Standard Error of Measurement. Estimates 



TEST 



STANn . 
GROUPS 
S.E.M. 



USV1 

SYSTEM 

S.E.M. 



ST THOMAS / 
ST JOHN 
S.E.M. 



ST CROIX 
S.E.M. 





Grade 12 - TASK II 


Level 






Reading 


. 2.60 


3.65* 


3.69* 


3.40* 


Mathematics 


2.80 


3.01 


2.98 


2.97 


English 


,3.30 


3.36 


3.45 


3.44 




Grade 10 - TASK I 


Level 






Reading « 


2.50 ' 


3.70*" 


3.48* 


3.50* 


Mathematics 


2.60 


2.07 


1.69 


".85 


English 


..1. 3.10 


3.38* 


3.41* - 


3.70* 




Grade 8 - Advanced 


Level 







Vocabulary 


3 


.10 


3 


.15 


3. 


10 


3 


.05 


Reading Comprehension 


3 


60 


3 


.45 


3. 


02 


3 


.46 


Mathematics Concepts . 


2. 


60 


2" 


65 


2. 


61 


2 


.67 


Mathematics Computation 


.... 2 . 


90 


2 


95 


2'. 


92 


2 


.94 


Mathematics Application 


2. 


60 


2: 


72 


2. 


72 


2 


.76 


Spelling 


3. 


30 


3. 


27 


3. 


33 


• 3 


.40 


Language 4 


3. 


90 


4. 


20* 


4. 


23 


4 


.07 



.1 



i 



9 

:RLC 



) . 



-38- 



i , 

Table 7 (cone . ) 

STAND . VS\!\ ST THOMAS/ 

CROUPS S i STEM ST JOriN ST CROIX 

TEST " S.E.M. S.E.M. -S.E.M. S.E.M. 



Grade 6 - Intermedi ace II Leoel 



Vocabulary 


3.0 


3, 


02 


3. 


05" 


Reading Coaiyr^ u hMon 


3. 50 ' 


3. 


29- 


3 


11 


Word Study Skills 


2.90 


1 _ 


93 


2. 


82 


Mathematics Concepts 


2.60 


*- . 


60* 


) 


59 


Mathematics Computation 


2.90 


2. 


88 


2 


-78 


Mathematics Application 


2.60 


2 . 


63 




64 


Spelling 


3.30 


3. 


33 


3 


. 18 


Language 


4.0 


3. 


9i 


3 


.98 


brade 4 - 


Primary 


Ill 


Leve 1 







4 i 



3.05 
3.46 
2.93 
2.64 
2.94, 
2.60 
3.30 
4.14 



Vocabulary / 2.30 2.96 2.95 2.91 

Reading Compression 3. 20 3.46* 3. 38 - 3.35 



Word Study Skills 3.00 3.16 3.17 3.06 

Mathematics Concepts 2.40 2.55 2.55 • 2.47 

Mathematics Computation 2.50 2.52 2.55 2.47 

Mathematics Application 2.0 2.16* ' " 2.21* 2.18 

Spelling 2.70 2.69 2.76 2.74 

Language 3.10 3.28* 3.26 3.25 



Table 7 (cont.) 



TEST 



STAND. USVI 
GROUPS SYSTEM 
S.E.M. S.E.M. 



ST THOMAS/ 
ST JOHN 
S.E.M. 



ST CROIX 
S.E.M. 



Grade 

1 


2 - 


Primary 1' Level 










Vocabulary 


2 


.50 


2.68 


2. 


57 


2 


72 


Ke aamg - ra-r t A 


2 


.50 


1.95 


1. 


52 


2 


51 


Reading - Part B % 


2 


40 


.2.53 


2. 


52 


2. 


68 


Word Study Skills 


2 


8Q 


2.65 


2. 


62 


2. 


68 - - 


Mathematics Concepts 


2. 


30 


2.38 


2. 


38 


2. 


37 


Mathematics Computation 


2. 


20 


2.17 


2. 


15 


2. 


21 


Listening Comprehension 


2. 


Or 


2.22* 


2. 


17 


2. 


34* ' 



'"'Significantly higher than the Standardization Groups at the p=.05 
level . 



L 



13 



-40- 



• 41- 

error of the score Che student actually received (the observed 
score) around 68£ of the time. The true score would be within 
two standar^ errors of the observed score approximately 96X of 
the time. Naturally, the lower the standard error of measure- 
ment, the more reliable the scores . * 

X 2 (chi-squared) tests 6 were used to Lest the hypotheses 
that the standard errors of measurement f'or the'tesc scores in 
the Virgin Islands sample were greater than those for the 
standardization sample. Sixteen of the 108 tests show signifi- 
cantly higher standard errors for the V.I. sample ' at* the p=.05 
level of significance. Nine of them occur in the t\igh school 
tests in reading and English areas. These tests need to be 
looked at closely. Among the remaining seven there seem to be 
no patterns. It .should be noted, however, that in two of these 
cases, Mathematics Applications in grade 4 and Listening Com- 
prehension in grade 2, the differences are in one district and 
in the total system scores. Since the total system scores are 
obviously affected by the individual district scores, it is 
possible that the large total system standard errors may .be a 
result of the lower reliability obtained from the district scores 



6 

xVdf - S 2 /^ 



where df is the number of degrees of freedom, 
1 S 2 is che square of the V.I. sample standard error of 

, measurement % \ 
a 2 is the square of\ the standardization groups standard 
error of measurement. 
(Darlington, 1973) 



Summary and Conclusi ons 

The scores obtained from the testing of a representative 
^sample of U.S. Virgin Tslands students using the 1973 edition 
of the Stanford Achievement Test appear to be both content 
valid and reliable. This is significant in that this test, and 
all standardized tests of academic achievement published in the 
United States, have been designed without including studies of 
noncontinental U.S. public school curriculum in the test plan- 
ning process or using noncontinental U.S. students in its 
standardization studies, 

'It is clear that the test objectives, as stated by the pub- 
lisher, are a good match for those used in U.S. Virgin Islands 
public schools, - In addition, the reliability estimates of the 
scores obtained from Virgin Islands students are, in most cases,- 
not significantly different from those obtained using the conti- 
nental U.S. standardization samples. At this point it ma^t be 1 
useful to examine the distinction between differences^ that are 
Statistically significant" and those that are "educationally 
signif icant . " The statement that two values are "statistically 
significant" implies that we are confident that the ^difference 
between the two values jls not zero . This is no guarantee that 
the differences are' not " trivial . "For instance, we may weigh 
two packages on the same, very accurate, scale and find that 
one weighs 25 kilograms while the other weighs 25.5 kilograms. 
If we were trying to decide which of these packages to assign 
each of two people to carry based on their relative strengths, 
we could probably conclude that "either person could carry 



either package . The difference of one half of a kilogram was 
real (i.£. nonzero) , but it was so small that it was trivial . 
Likewise, differences in reliability estimates noted in this 
study may be statistically significant, but so small as to 
allow us to conclude that the test scores were reliable, enough 
for u^/co-pse to make educational ^decisions (i.e. the differences 
wffe not educationally, significant). With the exception of. the 
grade 12 and grade 10 Reading test scores and the grade 10' 
English test Scores from the St. Croix district, the differences 
observed in standard error of measurement estimates seem to be 

• so small as not to be educationally significant. 

Putting aside the" question of the comparability of the obtained 
reliability estimates between the standardization samples and the 
U.S. Virgin Islands sample, the question of whether or not the 
scores obtained from the U. S.V.I, sample are reliable elnough for 

•/us to use ^them to make educational decisions needs to be addressed 
n The degree of reliability we demand in our educational measures 
depends largely on the nature of tfte decision to be made" 
(Gronlund, 1976, p. 124). Standardized test results are .used by 
school personnel as one .source of information for making instruc- 
tional and curricular decisions . Other sources of information 
such as teacher made classroom tests and observational techniques 
are combined with the results of standardized tests before final 
educational decisions are made in schools. Finally, this partiq- , 
ular study was designed to point out strengths and weaknesses in 
basic skills areas in U.S. Virgin Islands public schools. Those 



persons entrusted with the responsibility for making curricular 
and instructional decisions in the Department of Education will, 
use^ this and other information before making changes in what 
goes on in schools. Further, decisions made will always be 
open to confirmation and change. Cronbach (1970) points out 
that the reversability of decisions made oh the basis of test 
data is an important factor to (lake into consideration in making 
judgements concerning desired levels of reliability. The reli- 
ability estimates obtained from the^U. S.V.I, sample which seem 
to cluster from the middle .80' s to the middle .90's are more 
than adequate to allow the confident use of the obtained scores. 



• - -45 
References 



Anastasi A. Differential Psychology (3rd ed.). New York: 
Macmiilan, 1958. 

Asher, W.J. Educational Research and Evaluation Methods : Boston: 
Little, Brown & Co. , 1976 . 

Chase, C.[. Rtvit-w of the Test -of Academic Skills. In 0.K- Burbs 
(Ed. ) , The Eighth Mental Measure ments Yearbook (vol . L) . 
Highland Park, N.J.: Gryphon Press, W7T. 

Cronbach, L.J. Essentials of Psychological Testing (3rd ed,). 
New York: Harper and Row, J 970. 

Cronfcfach, L.J. & fleenl, P.E. Construct validity in psychological 
tests. P sychological Bulletin , 1955, 52, 281-30J. 

Darlington, K.B. Radical and Squares . Ithaca, M.Y.: Logan Hill 
Press, 1975. 

Deutsch, M. Minority group and class status related to 

social and personality factors in scholi^stic achievement 
(Monograph No. 2). . Ithaca, N.Y.: The Society for Applied 
Anthropology, 1960. 



ERLC 



Ebel, R.L. Must all tests be valid? A merican Psychologist , 1961, 
16, 640-643. " 

Ebel, R.L. Review of the 1973 edition of the Stanford Achievement 
TeSt. In O.K. Buros (Ed.), T he E ig hth tyentaY Measurements 
Yea rbook (vol.1). Highland Park, N.J.: Gryphon Press, 1978. 

Ebel, R.L. Essentials of Educational Measurements (3rd ed.). 
Englewood Cliffs, N.J.: Prentice-Hall, 1979. 

French, J.LL it Michael, W.B. (Cochairmen) Standards for Educational 
a nd Psychological Tests and Manuals . Washington, D.C. : ~~ 
American Psychological Association, 1966. 

Gronlund, N.E. Measurement and Evaluation in Teaching (3rd ed.). 
New York :' Macmiilan , 1976 . 

Hayes, W.L. St atistics for the Social Sciences (2nd ed.). New 
York: HoTt, Rinehart , and Winston, 1973. 

Kasdon, L.H. The Stanford Achievement Test (Review of the. 1973 

edition of the Stanford Achievement Test). Reading 'Teacher, 
1974, 27, 743+. 



3 



N 

-46- 

Lehmann, I.J. The Stanford Achievement Test Series - 1973 

(Review of the 1973 edition of the Stanford' Achievement 
Tc^t). Journal o'f Educati onal Measurement, 1975, 12, 
297-306. ! : ~ — 

Passow, A.H. Review of the 1973 edition of the Stanford Achieve- 
, ment Test. In O.K. Buros (Ed.), The Eighth Mental" Measure- 
nients Yearbook (vol. 1). Highland Park, N.J.: Gryphon Press, 
1978. 

Stanford Achievement Test. Technical Data Report . New York: * 
Harcourt Brace Jovanoviofr., 1975 . 

.Thorndike, R.L. Review of t;he Tes.t of Academic Skills. In O.K. ' 
Buros (Ed.), The Eighth Mental Measurements Yearbook (vol. 1) 
Highland Park, N.J. V Gryphon Press, 1978. 

'•. 

Tyler, L. The Psychology of Individual Differences (2nd ed.). 
New York: Appleton-Century-Crof ts , 1956. > 

* 

Warwick, D.P. & Lininger , C.A. The Sample Survey: Theory and 
-Practice. New York: McGraw-Hill, 1975.' *~ 



